{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# ETL one cycle\n",
    "> Normally ETL is processed by 3 steps, E, T, L :) but we could do it by one cycle, ETL.\n",
    "\n",
    "We are going to use the 3 configs from `ETL_how_to_run.ipynb` and merge it to one config file."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 🌌 1. prepare config"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from pathlib import Path\n",
    "from dataverse.config import Config \n",
    "from omegaconf import OmegaConf\n",
    "\n",
    "# E = Extract, T = Transform, L = Load\n",
    "main_path = Path(os.path.abspath('../..'))\n",
    "ETL_path = main_path / \"./dataverse/config/etl/sample/ETL___one_cycle.yaml\"\n",
    "\n",
    "ETL_config = Config.load(ETL_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 🌠 ETL Config\n",
    "> One cycle from raw to huggingface dataset\n",
    "\n",
    "- load fake generation UFL data\n",
    "- sample 10% of total data to reduce the size of dataset\n",
    "- deduplicate by `text` column, 15-gram minhash jaccard similarity\n",
    "- convert to huggingface dataset and return the object"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "spark:\n",
      "  appname: dataverse_etl_sample\n",
      "  driver:\n",
      "    memory: 16g\n",
      "etl:\n",
      "- name: data_ingestion___test___generate_fake_ufl\n",
      "- name: utils___sampling___random\n",
      "  args:\n",
      "    sample_n_or_frac: 0.1\n",
      "- name: deduplication___minhash___lsh_jaccard\n",
      "- name: data_load___huggingface___ufl2hf_obj\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(OmegaConf.to_yaml(ETL_config))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 🌌 2. put config to ETLPipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dataverse.etl import ETLPipeline\n",
    "\n",
    "etl_pipeline = ETLPipeline()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Setting default log level to \"WARN\".\n",
      "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n",
      "23/11/14 19:01:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n",
      "23/11/14 19:01:30 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).\n",
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading and preparing dataset spark/-1276752201 to /root/.cache/huggingface/datasets/spark/-1276752201/0.0.0...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dataset spark downloaded and prepared to /root/.cache/huggingface/datasets/spark/-1276752201/0.0.0. Subsequent calls will reuse this data.\n"
     ]
    }
   ],
   "source": [
    "# raw -> hf_obj\n",
    "spark, dataset = etl_pipeline.run(ETL_config)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dataset({\n",
       "    features: ['id', 'meta', 'name', 'text'],\n",
       "    num_rows: 14\n",
       "})"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'id': 'f073d6d2-20c4-4244-8c5b-e47799840bc7',\n",
       " 'meta': '{\"name\": \"Emma Collins\", \"age\": 37, \"address\": \"4980 Thompson Plains\\\\nSouth Kimberly, NV 38999\", \"job\": \"Phytotherapist\"}',\n",
       " 'name': 'test_fake_ufl',\n",
       " 'text': 'Agency tend rock teacher body collection spend. Thing surface close pretty.'}"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset[0]"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llm",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.11"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
